Sports analytics: collection of relevant, historical, statistics that can provide a competitive advantage to a team or individual.
In this case study, we will analyze data from NBA and try to answer these questions
How can we know who are the best NBA players based on basketball skills (not only points)?
Do NBA players have skills that exceed their predefined positions (point guard, shooting guard, small forward, power forward, and center)?
Can we help NBA coaches to make better decisions? Are they really making right decisions?
# delete everything
rm(list=ls())
library(tidyverse)
library(GGally) # ggplot2-based visualization of correlations
library(factoextra) # ggplot2-based visualization of pcaCurrently, the NBA is composed of 30 teams (29 in the United States and 1 in Canada) with more than 500 players
Data from last season: https://www.basketball-reference.com/leagues/NBA_2022_per_game.html
The dataset contains skill performance of all the players (605) during the last season 2020-21
The glossary is in the web page. Performance is per game
players.df = read.csv(file = "NBA_2022.csv")
glimpse(players.df)## Rows: 812
## Columns: 30
## $ Rk <int> 1, 2, 3, 4, 5, 6, 6, 6, 7, 8, 9, 10, 10, 10, 11, 12, 13, 14, 15…
## $ Player <chr> "Precious Achiuwa\\achiupr01", "Steven Adams\\adamsst01", "Bam …
## $ Pos <chr> "C", "C", "C", "PF", "C", "SG", "SG", "SG", "SG", "C", "PG", "S…
## $ Age <int> 22, 28, 24, 21, 36, 23, 23, 23, 26, 23, 23, 28, 28, 28, 28, 27,…
## $ Tm <chr> "TOR", "MEM", "MIA", "MEM", "BRK", "TOT", "NOP", "UTA", "MIL", …
## $ G <int> 73, 76, 56, 32, 47, 65, 50, 15, 66, 56, 54, 16, 3, 13, 69, 67, …
## $ GS <int> 28, 75, 56, 0, 12, 21, 19, 2, 61, 56, 1, 6, 0, 6, 11, 67, 6, 3,…
## $ MP <dbl> 23.6, 26.3, 32.6, 11.3, 22.3, 22.6, 26.3, 9.9, 27.3, 32.3, 15.4…
## $ FG <dbl> 3.6, 2.8, 7.3, 1.7, 5.4, 3.9, 4.7, 1.1, 3.9, 6.6, 2.4, 2.3, 1.3…
## $ FGA <dbl> 8.3, 5.1, 13.0, 4.1, 9.7, 10.5, 12.6, 3.2, 8.6, 9.7, 5.4, 5.9, …
## $ FG. <dbl> 0.439, 0.547, 0.557, 0.402, 0.550, 0.372, 0.375, 0.333, 0.448, …
## $ X3P <dbl> 0.8, 0.0, 0.0, 0.2, 0.3, 1.6, 1.9, 0.7, 2.4, 0.0, 0.6, 0.9, 0.7…
## $ X3PA <dbl> 2.1, 0.0, 0.1, 1.5, 1.0, 5.2, 6.1, 2.2, 5.9, 0.2, 2.0, 3.7, 2.0…
## $ X3P. <dbl> 0.359, 0.000, 0.000, 0.125, 0.304, 0.311, 0.311, 0.303, 0.409, …
## $ X2P <dbl> 2.9, 2.8, 7.3, 1.5, 5.1, 2.3, 2.8, 0.4, 1.5, 6.6, 1.8, 1.3, 0.7…
## $ X2PA <dbl> 6.1, 5.0, 12.9, 2.6, 8.8, 5.3, 6.5, 1.0, 2.7, 9.6, 3.4, 2.3, 0.…
## $ X2P. <dbl> 0.468, 0.548, 0.562, 0.560, 0.578, 0.433, 0.434, 0.400, 0.533, …
## $ eFG. <dbl> 0.486, 0.547, 0.557, 0.424, 0.566, 0.449, 0.450, 0.438, 0.588, …
## $ FT <dbl> 1.1, 1.4, 4.6, 0.6, 1.9, 1.2, 1.4, 0.7, 1.0, 2.9, 0.7, 0.9, 1.0…
## $ FTA <dbl> 1.8, 2.6, 6.1, 1.0, 2.2, 1.7, 1.9, 0.8, 1.1, 4.2, 1.0, 1.2, 1.3…
## $ FT. <dbl> 0.595, 0.543, 0.753, 0.625, 0.873, 0.743, 0.722, 0.917, 0.865, …
## $ ORB <dbl> 2.0, 4.6, 2.4, 1.0, 1.6, 0.6, 0.7, 0.1, 0.5, 3.4, 0.5, 0.3, 0.0…
## $ DRB <dbl> 4.5, 5.4, 7.6, 1.7, 3.9, 2.3, 2.6, 1.5, 2.9, 7.3, 1.4, 2.6, 2.0…
## $ TRB <dbl> 6.5, 10.0, 10.1, 2.7, 5.5, 2.9, 3.3, 1.5, 3.4, 10.8, 1.9, 2.9, …
## $ AST <dbl> 1.1, 3.4, 3.4, 0.7, 0.9, 2.4, 2.8, 1.1, 1.5, 1.6, 2.8, 2.1, 2.0…
## $ STL <dbl> 0.5, 0.9, 1.4, 0.2, 0.3, 0.7, 0.8, 0.3, 0.7, 0.8, 1.3, 0.5, 0.3…
## $ BLK <dbl> 0.6, 0.8, 0.8, 0.3, 1.0, 0.4, 0.4, 0.3, 0.3, 1.3, 0.1, 0.4, 0.0…
## $ TOV <dbl> 1.2, 1.5, 2.6, 0.5, 0.9, 1.4, 1.7, 0.5, 0.7, 1.7, 0.7, 0.5, 0.7…
## $ PF <dbl> 2.1, 2.0, 3.1, 1.1, 1.7, 1.6, 1.8, 1.0, 1.5, 1.7, 1.4, 1.4, 1.3…
## $ PTS <dbl> 9.1, 6.9, 19.1, 4.1, 12.9, 10.6, 12.8, 3.5, 11.1, 16.1, 6.1, 6.…
hist(rowMeans(is.na(players.df)))barplot(colMeans(is.na(players.df)), las=2)Conclusions? What can we do?
We are going to skip the following variables:
Rk: Rank
Player: Player’s name
Pos: Position
Age: Player’s age on February 1 of the season
Tm: Team
G and GS: Games played and started (they depend on the coach, not the player’s skills)
MP: Minutes played per game (they depend on the coach, not the player’s skills)
nba <- players.df[, 9:ncol(players.df)]
names = players.df[, 2]
names = names %>% str_replace("\\\\.*$", "")
pos = players.df[,3]
team = players.df[,5]
gp = players.df[,6]
min = players.df[,8]
points = players.df[,30]
dim(nba)## [1] 812 22
summary(nba)## FG FGA FG. X3P
## Min. : 0.00 Min. : 0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 1.20 1st Qu.: 3.000 1st Qu.:0.3850 1st Qu.:0.2000
## Median : 2.40 Median : 5.150 Median :0.4410 Median :0.7000
## Mean : 2.87 Mean : 6.387 Mean :0.4343 Mean :0.8713
## 3rd Qu.: 3.90 3rd Qu.: 8.725 3rd Qu.:0.5000 3rd Qu.:1.4000
## Max. :11.40 Max. :21.800 Max. :1.0000 Max. :4.5000
## NA's :15
## X3PA X3P. X2P X2PA
## Min. : 0.000 Min. :0.0000 Min. :0.0 Min. : 0.000
## 1st Qu.: 0.800 1st Qu.:0.2587 1st Qu.:0.7 1st Qu.: 1.400
## Median : 2.050 Median :0.3310 Median :1.5 Median : 3.000
## Mean : 2.561 Mean :0.3034 Mean :2.0 Mean : 3.829
## 3rd Qu.: 3.900 3rd Qu.:0.3762 3rd Qu.:2.8 3rd Qu.: 5.100
## Max. :11.700 Max. :1.0000 Max. :9.5 Max. :18.300
## NA's :72
## X2P. eFG. FT FTA
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. : 0.000
## 1st Qu.:0.4507 1st Qu.:0.4640 1st Qu.:0.400 1st Qu.: 0.500
## Median :0.5160 Median :0.5170 Median :0.900 Median : 1.200
## Mean :0.5055 Mean :0.4975 Mean :1.204 Mean : 1.575
## 3rd Qu.:0.5793 3rd Qu.:0.5630 3rd Qu.:1.600 3rd Qu.: 2.000
## Max. :1.0000 Max. :1.0000 Max. :9.600 Max. :11.800
## NA's :28 NA's :15
## FT. ORB DRB TRB
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.000
## 1st Qu.:0.6735 1st Qu.:0.3000 1st Qu.: 1.30 1st Qu.: 1.700
## Median :0.7650 Median :0.6000 Median : 2.30 Median : 2.900
## Mean :0.7476 Mean :0.8129 Mean : 2.52 Mean : 3.332
## 3rd Qu.:0.8460 3rd Qu.:1.1000 3rd Qu.: 3.40 3rd Qu.: 4.400
## Max. :1.0000 Max. :4.6000 Max. :11.00 Max. :14.700
## NA's :97
## AST STL BLK TOV
## Min. : 0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.500 1st Qu.:0.3000 1st Qu.:0.1000 1st Qu.:0.4000
## Median : 1.200 Median :0.5000 Median :0.3000 Median :0.8000
## Mean : 1.808 Mean :0.5828 Mean :0.3536 Mean :0.9787
## 3rd Qu.: 2.400 3rd Qu.:0.9000 3rd Qu.:0.5000 3rd Qu.:1.3000
## Max. :10.800 Max. :2.5000 Max. :2.8000 Max. :4.8000
##
## PF PTS
## Min. :0.000 Min. : 0.000
## 1st Qu.:1.000 1st Qu.: 3.300
## Median :1.600 Median : 6.300
## Mean :1.565 Mean : 7.812
## 3rd Qu.:2.200 3rd Qu.:10.600
## Max. :5.000 Max. :30.600
##
We remove some variable as they are redundant:
The decision about the final variables to consider (input) is key and will affect the output.
We are going to consider the following variables to measure skill performance:
nba = nba[, c("X2P", "X2PA", "X3P", "X3PA", "FT", "FTA", "ORB", "DRB", "AST", "STL", "BLK", "TOV", "PF")]
dim(nba)## [1] 812 13
Our input has dimension \(p = 13\), that implies \(2^p\) different relations between the variables.
Dimension 1: univariate analysis for all 13 variables
# Enter your code here to plot the boxplots and scaled boxplotsDimension 2: bivariate analysis (scatter plots), in total 105
# Enter your code here to display a correlation graphConclusions?
Dimension 3: proportional to \(13^3\) relations, but no way to obtain information… This is why we need an analytical tool to reduce the dimension
Finally, note there are variables highly correlated, especially the most related ones (like ftm and fta)
From dimension 13 to dimension 2
pca = prcomp(nba, scale=T)
# pca = princomp(nba, cor=T) # the same, but using SVD instead of eigen decomposition
summary(pca)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.7676 1.4620 0.96913 0.81706 0.62185 0.57505 0.53443
## Proportion of Variance 0.5892 0.1644 0.07225 0.05135 0.02975 0.02544 0.02197
## Cumulative Proportion 0.5892 0.7536 0.82586 0.87722 0.90696 0.93240 0.95437
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.50433 0.40255 0.34939 0.16989 0.12057 0.10645
## Proportion of Variance 0.01957 0.01246 0.00939 0.00222 0.00112 0.00087
## Cumulative Proportion 0.97394 0.98640 0.99579 0.99801 0.99913 1.00000
Insights?
This is the same, but using mathematical format; here, eigenvalues denote variances and eigenvectors denote loadings:
R = cor(nba) # correlation matrix
eigen(R) ## eigen() decomposition
## $values
## [1] 7.65947992 2.13754026 0.93921182 0.66758529 0.38669376 0.33068023
## [7] 0.28561635 0.25434919 0.16204491 0.12206997 0.02886108 0.01453617
## [13] 0.01133104
##
## $vectors
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -0.3283502 0.13090889 -0.2005587 0.04829601 0.15155669 -0.11847426
## [2,] -0.3338110 0.05726485 -0.2149608 0.01997509 0.15079201 -0.10792035
## [3,] -0.2165818 -0.44988703 0.3270124 0.34979052 0.09813958 -0.11673509
## [4,] -0.2220291 -0.45601676 0.3028672 0.31876463 0.09909183 -0.11644052
## [5,] -0.3151506 -0.02220766 -0.3602631 0.24802791 -0.26081680 0.22320970
## [6,] -0.3176003 0.03410196 -0.3538379 0.23496798 -0.23251370 0.20947568
## [7,] -0.1706977 0.51802347 0.1647788 0.03243788 0.43806350 -0.25224166
## [8,] -0.3064642 0.21531566 0.1210756 0.08991934 0.20035959 -0.20723908
## [9,] -0.2806262 -0.23411852 -0.1791940 -0.44990852 -0.00494676 -0.17663130
## [10,] -0.2571328 -0.13595733 0.2875929 -0.59275159 -0.26273293 -0.05941315
## [11,] -0.1984953 0.38398402 0.3809544 0.15411136 -0.68250377 -0.21239041
## [12,] -0.3202672 -0.11113691 -0.1299566 -0.24102592 0.08538040 -0.06084444
## [13,] -0.2728810 0.14063314 0.3758526 -0.10873256 0.19635650 0.81444139
## [,7] [,8] [,9] [,10] [,11]
## [1,] 0.06126662 0.544538719 -0.07174945 -0.035707018 -0.1376590598
## [2,] 0.04303944 0.543542648 -0.01279845 -0.022388792 0.1185337555
## [3,] 0.05344343 -0.016030111 0.12455926 -0.084641284 -0.6786485977
## [4,] 0.00579295 0.008854199 0.07311835 0.021738245 0.7053768001
## [5,] 0.18793768 -0.247888195 0.05616742 -0.068538871 0.0239410677
## [6,] 0.21219982 -0.245950849 0.07334150 0.006354774 0.0028134146
## [7,] 0.23485823 -0.294879836 0.51184162 -0.105619837 0.0468640306
## [8,] -0.06206350 -0.336728258 -0.79512629 0.040176377 -0.0155598338
## [9,] -0.39716825 -0.192994670 0.08975304 -0.628037364 0.0199684449
## [10,] 0.62508721 0.016156205 -0.06170887 0.117227255 -0.0009188301
## [11,] -0.32100339 0.131473087 0.10809410 -0.030908786 0.0078434368
## [12,] -0.40975843 -0.152064699 0.21801742 0.743916295 -0.0734002718
## [13,] -0.18089462 0.086639395 -0.00244457 -0.102661750 0.0033532861
## [,12] [,13]
## [1,] 0.6509532697 -0.214704267
## [2,] -0.6706093068 0.206338548
## [3,] -0.1008515625 0.040220826
## [4,] 0.1487045165 -0.020845168
## [5,] -0.2051007099 -0.666161151
## [6,] 0.2175727792 0.679210925
## [7,] -0.0156227616 -0.033075890
## [8,] -0.0497566704 0.009299392
## [9,] 0.0419091939 0.032480254
## [10,] 0.0027552474 -0.014898604
## [11,] -0.0168137152 0.006779637
## [12,] 0.0093413656 -0.040742026
## [13,] 0.0009904054 0.003574908
screeplot(pca,main="Screeplot",col="blue",type="barplot",pch=19)Nicer with factoextra package:
fviz_screeplot(pca, addlabels = TRUE)Note with 2 components we explain 75% of variability
First component:
barplot(pca$rotation[,1], las=2, col="darkblue")Any hint for the meaning of the 1st PC?
Note the sum of the squared loadings (eigenvectors) is equal to 1
sum(pca$rotation[,1]^2)## [1] 1
That means squared loadings are easier to interpret than the loadings
I.e. they are like percentages (numbers between 0 and 1)
So let’s plot squared loadings instead
They are called contribution of variables to components
fviz_contrib(pca, choice = "var", axes = 1)The red dashed line on the graph above indicates the expected average contribution
If the contribution of the variables were uniform, the expected value would be 1/length(variables) = 1/15 = 6.6%
Now we can rank the players by their first PC scores: best historical players in terms of performance:
# The worst
names[order(pca$x[,1])][(length(names)-5):length(names)]## [1] "Sam Dekker" "Jaime Echenique" "Tyler Hall" "DeJon Jarreau"
## [5] "Juwan Morgan" "Ade Murkey"
# The best
names[order(pca$x[,1])][1:10]## [1] "Joel Embiid" "Giannis Antetokounmpo" "Nikola Jokić"
## [4] "Luka Dončić" "James Harden" "James Harden"
## [7] "LeBron James" "Kevin Durant" "James Harden"
## [10] "Trae Young"
Nikola Jokić was the Most Valuable Player in the season: https://www.basketball-reference.com/leagues/NBA_2021_leaders.html
Second component:
barplot(pca$rotation[,2], las=2, col="darkblue")Any insight about this component? Forward vs Guard?
Maybe we can get more insights by ranking the players using this component:
names[order(pca$x[,2])][1:6]## [1] "Stephen Curry" "Fred VanVleet" "Trae Young" "Damian Lillard"
## [5] "Donovan Mitchell" "Darius Garland"
names[order(pca$x[,2])][(length(names)-5):length(names)]## [1] "Anthony Davis" "Clint Capela" "Jakob Poeltl"
## [4] "Mitchell Robinson" "Robert Williams" "Rudy Gobert"
Contribution of variables to second component:
Take care because we loose the sign (to get contribution in percentage)
fviz_contrib(pca, choice = "var", axes = 2)Once we have interpreted the meaning of the first two components, let’s see the contribution of each player to components
For the \(i\)-th player and first component, the contribution is: \(z_{1,i}^2 / \lambda_1 / n\), which is a number between 0 and 1
head(get_pca_ind(pca)$contrib[,1]) # this is in %, that is between 0 and 100## 1 2 3 4 5 6
## 0.02361101 0.09512845 0.69160618 0.04284392 0.03938898 0.01493083
head((pca$x[,1]^2)/(pca$sdev[1]^2))/dim(nba)[1] # this is between 0 and 1## [1] 0.0002361101 0.0009512845 0.0069160618 0.0004284392 0.0003938898
## [6] 0.0001493083
Let’s visualize the top-100 players contributions to first component (global performance):
fviz_contrib(pca, choice = "ind", axes = 1, top=100)Let’s see the first names:
names[order(get_pca_ind(pca)$contrib[,1],decreasing=T)][1:10]## [1] "Joel Embiid" "Giannis Antetokounmpo" "Nikola Jokić"
## [4] "Luka Dončić" "James Harden" "James Harden"
## [7] "LeBron James" "Kevin Durant" "James Harden"
## [10] "Trae Young"
# It is very similar to names[order(pca$x[,1])][1:10] but in percentageFinally, let’s make a zoom to see the top-20 players in contributions:
names_z1 = names[order(get_pca_ind(pca)$contrib[,1],decreasing=T)]
fviz_contrib(pca, choice = "ind", axes = 1, top=20)+scale_x_discrete(labels=names_z1)Biplot: observations and variables in same graph (using first 2 components)
biplot(pca)Not informative in this case: too many players
Nicer and using contributions (instead of loadings), without players:
fviz_pca_var(pca, col.var = "contrib")Nicer but again too much information:
fviz_pca_biplot(pca, repel = TRUE)Remember, for the \(j\)-th principal component: \(Z_j = X a_j\), \(a_j\) denotes the loadings, and \(Z_j\) denotes the scores
Let’s plot the first two scores, using colors for minutes played:
data.frame(z1=-pca$x[,1],z2=pca$x[,2]) %>%
ggplot(aes(z1,z2,label=names,color=min)) + geom_point(size=0) +
labs(title="PCA", x="PC1", y="PC2") +
theme_bw() + scale_color_gradient(low="lightblue", high="darkblue")+theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE) The first two PCs seem independent, but this is not always the same. What is true is that they are always uncorrelated.
The first component is highly correlated with minutes (decided by coaches)
The same, but using colors for games played:
data.frame(z1=-pca$x[,1],z2=pca$x[,2]) %>%
ggplot(aes(z1,z2,label=names,color=gp)) + geom_point(size=0) +
labs(title="PCA", x="PC1", y="PC2") +
theme_bw() + scale_color_gradient(low="yellow", high="darkred")+theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE) Insights?
Which are the teams with the best players?
data.frame(z1=-pca$x[,1],team=team) %>% group_by(team) %>% summarise(mean=mean(z1)) %>% arrange(desc(mean))## # A tibble: 31 × 2
## team mean
## <chr> <dbl>
## 1 POR 1.09
## 2 OKC 0.735
## 3 BRK 0.728
## 4 HOU 0.610
## 5 IND 0.573
## 6 GSW 0.516
## 7 DET 0.356
## 8 NOP 0.330
## 9 ORL 0.323
## 10 LAL 0.321
## # … with 21 more rows
Portland Trail Blazers has the best overall players
Other view: are the better players playing more minutes in a game?
data.frame(z1=-pca$x[,1],z2=min) %>%
ggplot(aes(z1,z2,label=names,color=gp)) + geom_point(size=0) +
labs(title="Performance", x="PC1", y="Minutes per game") +
scale_color_gradient(low="lightblue", high="darkblue") +
theme_bw() + theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE)Yes, but non-linear… why?
Are the better players playing more games?
data.frame(z1=-pca$x[,1],z2=gp) %>%
ggplot(aes(z1,z2,label=names,color=min)) + geom_point(size=0) +
labs(title="Performance", x="PC1", y="Games played") +
scale_color_gradient(low="lightblue", high="darkblue") +
theme_bw() + theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE)Somehow, but not too much… why?